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Abstract 

The dynamical and stationary properties of on-line learning from finite training sets are analysed 
using the cavity method. For large input dimensions, we derive equations for the macroscopic 
parameters, namely, the student-teacher correlation, the student-student autocorrelation and the 
learning force fluctuation. This enables us to provide analytical solutions to Adaline learning as 
a benchmark. Theoretical predictions of training errors in transient and stationary states are 
obtained by a Monte Carlo sampling procedure. Generalization and training errors are found to 
agree with simulations. The physical origin of the critical learning rate is presented. Comparison 
with batch learning is discussed throughout the paper. 
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I. INTRODUCTION 



In recent years, there have been many attempts to analyse the dynamics of learning 
from examples in classification and regression This refers to the dynamical process of 
minimizing the risk functions of the classifier or regressor, often via gradient descent, until 
a steady state is reached. Despite progress in understanding the steady- state behavior of 
learning processes , the dynamics of learning was much less understood. This is probably 
due to the high complexity in its analysis, since it typically involves the evolution of many 
microscopic parameters, each strongly interacting with others in a convolutional way. Yet, 
a number of important issues in improving the learning efficiency depend on a better under- 
standing of its dynamics, including the speed of convergence, the early stopping point for 
optimal generalization, the shortening of the plateau regime, and the avoidance of getting 
trapped in local minima 0, ||. Hence, it would be both useful and challenging to analyse 
the dynamics of learning. 

On-line learning is a common mode of implementing learning, in which an independent 
example is presented at each learning step. Significant progress has been made in the case 
of on-line learning of infinite training sets ^ |^. Since statistical correlations among 
the examples can be ignored, the dynamics can be described by instantaneous dynamical 
variables, leading to great advances in our understanding of on-line learning. However, 
in reality, the same restricted set of examples is recycled during the learning process. This 
introduces temporal correlations of the weights in the learning history, rendering the analysis 
at best an approximation to the reality. 

There were some attempts to understand on-line learning with recycled examples. Early 
researchers used the approximate Fokker-Planck equation to describe the learning process 

1^ . The use of perturbative expansions of the master equation was shown to be insufficient 
for a precise calculation of global properties of on-line learning [llO] . The difference between 



batch learning and on-line learning was investigated to the first order of the learning rate [11 



For general learning rates, the exact solution for Hebbian rule was derived in Ref . . Exact 
solutions were found for linear networks, and the generalization ability of on-line learning 
was found to outperform batch learning if bias is present in the input |jl3|]. The dynamics 
of on-line learning in multilayer neural networks were analysed using the dynamical replica 
method and solutions were found in the limit of large sizes of training sets |[1^ . 
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A recent work based on the generating functional approach is a good step forward toward 
a general theory of describing the dynamical and stationary properties of on-line learning 
p!5| . It illustrates the mean-field character of the dynamics in its description in terms of 
an effective single example. For random choices of the sequence of presented examples, the 
dynamics is characterized by the appearance of an example as a Poisson event in the learning 
sequence. Steady state properties were discussed by neglecting fluctuations in the learning 
forces (referred to as the mean- force approximation hereafter). 

In this paper, we propose an analysis of on-line learning with recycled examples using the 
cavity method. The cavity method is a mean-field analysis first used in magnetic systems 
T6| . It enables us to understand the properties of a system by focusing on the response of the 



system to a single element added to it. It was later generalized to study learning in neural 
networks with the advantages of a clear physical picture and microscopic insights to both 
their equilibrium and dynamical properties [|, |^, 0. The cavity method was subsequently 
applied to analyse the dynamics of batch learning, in which the entire set of examples is 
provided for each learning step |Q, It provides dynamical equations and obtains important 
results on the overtraining, early stopping, noise effects and average learning strategy. 

To adapt the cavity method from batch learning to on-line learning in this paper, there 
is a need to account for the following subtleties, (a) Averaging over the choice of sequencing 
the examples is now necessary, (b) The measurements of an example observed at an instant 
is now correlated with the instants when it was learned. This is due to the giant boost of 
that example at a learning step, which upsets the uniformity of the examples as in the case 
of batch learning. 

The purposes of this paper are: (a) to perform an exact analysis of the learning dynam- 
ics as far as the formulation allows, so that minimal approximations are made, and deeper 
physical insights can be extracted; (b) to illustrate the analytical approach using the simple 
example of a linear learning rule, which can act as a benchmark for verifying the validity 
of the theory, and a theoretical framework for more complicated systems, such as nonlin- 
ear learning rules and multilayer networks; (c) to explore efficient Monte Carlo procedures 
implied by the distribution of example activations predicted by the theory, which can be 
applied to the more complicated cases; (d) to study the difference between on-line learning 
and batch learning for general learning conditions. 

The paper is organized as follows. In Sec. II we describe the dynamics of on-line learning. 
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In Sec. Ill we introduce the cavity method and derive the dynamical equations for the 
macroscopic measurements: (a) G{t,s) and D{t,s), the Green's functions of weights and 
examples in response to stimuli; (b) R(t), the correlation between the teacher and student 
weight vectors, and C{t,s), the autocorrelation between student weight vectors at different 
times; (c) the fluctuation of the learning force (F^(t)). The Monte Carlo procedure to 
calculate the training error is also presented. In Sec. IV, we compare theoretical predictions 
with simulation results. The average learning strategy in the long time limit is proposed 
and compared with the performance of batch learning. In Sec. V, we summarize our work 
and propose some future directions. In Appendix, we describe the mathematical details of 
the procedure of sequence averaging. 

II. FORMULATION 

We consider a training set of p examples generated by a teacher network with N weights 
Bj, j — 1, • • • ,N. For definiteness, we set |S| = 1. Each example /i — 1, • • • ,p consists of 
an A'"-dimensional input vector and a teacher generated output y^. It is convenient to 
introduce the parameter a = p/N. The inputs are Gaussian variables with zero mean 
and unit variance. The outputs are 



where = B • ^'^ \s the teacher activation, 2;^ is a Gaussian variable with zero mean and 
unit variance, and e is the noise amplitude. 

The examples are learned by a student network with the same number of inputs and 
output. At each learning step, one example in the training set is randomly drawn. If the 
example drawn out at time t is cr(t), then the weights are modified according to 




sgn(y^ + ez^ for classification 
UtJt + for regression. 



(1) 



where Xcr(t){t) = J{t) ■ is the student activation, v is the learning rate and A is the 
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weight decay. The force F{x, y) describes the learning rule, 



y 



for Hebbian rule 



F{xiy) = {y-x 



for Adaline rule 



(2) 



— xy)Q{n — xy) for Adatron rule, 
where G is the step function and k is the stability. The last term in Eq. (|1|) is the dynamical 



noise term, often added to avoid the learning procedure being trapped in local minimum, 
with {Vj{'t)Vk{s)) = 2T6ts/N and T is the dynamical noise level. 

In the limit of vanishing learning rate v, the on-line djTiamics described by Eq. (|l|) is 
equivalent to the batch learning formulation in when the time scale, weight decay and 
the dynamical noise in the latter are multiplied by factors of a/v, 1/a and jcP' respectively. 
However, for finite learning rate f , the randomness of the learning sequence adds noise to 
the dynamics. 

III. THE CAVITY METHOD 

A. The cavity activation and the Green's functions 

Consider a new example that is not included in the original training set. We define 
its activation at time t in the network trained without that example as its cavity activation 
/io(t), i. e., /io(^) = J{t) ■ It is a random variable since the network has not learned the 
information of this new example. When the size of the network N is very large, ho{t) is a 
Gaussian variable with mean R{t)yQ and covariance C{t, s)—R{t)R{s), where R{t) = B-J{t) 
is the student-teacher correlation at time t, and C(t, s) = J(t) ■ J{s) is the student-student 
autocorrelation at times t and s. Both R{t) and C{t,s) are self-averaging in the limit of 
large N. 

Now we consider the evolution of another network Jj{t) in which the example is added 
to its training set. To ensure that the probability of occurrence of the new example and 
the old ones remain identical, the new example sequence cr^{t) is obtained from the original 
example sequence a(t) according to 




probabihty = 1 — 
probability = 



(3) 
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In the new example sequence cr''(t), at each learning step, the weight is modified according 
to 

where x^o(t)i't) = «^°(^) ' Compared the networks Jj{t) and Jj{t), we obtain from 

Eqs. i) and (|), 

where S is the time shift operator. Let G^^^{t - t') be the bare Green's function 

- t') = 0(t+ - - ^)(1 - ^)^(*-*'--). (6) 

It satisfies 

{S-l + '^)G^'\t-t') = 5,,. (7) 

We assume that the adjustment from Jj{t) to J^{t) is small so that linear response theory 
is applicable. Then on separating the contributions from the new example and the old ones, 
we have 



t'<t 

k,t'<t 

where Fcr(t){t) is the shorthand notation of the force acting on example a{t) at time t, and F' 
represents the derivative of the force with respect to the activation x. We can now interpret 
this result from the viewpoint of the linear response theory. The first term on the right hand 
side describes the primary effects of adding example to the training set and is the driving 
term for the difference between the two networks. This occurs at the discrete instants with 
(T°(t') = by adding the force due to example and removing that due to the original 
example cr(t'). The second term describes the many-body reactions due to the change of the 
original examples caused by the added example, and is referred to as the Onsager reaction 
term. Describing the response to the driving term by the Green's function, Eq. (||) reduces 
to 

k,t'<t 
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where Gjk(t, s) is the time-dependent Green's function with iterative expression 

G,,{t, s) = 6,,G^'\t - ^) + ^ E E ^^°^(^ - ^')5."{t'Mt')ef' ^^..(t', s). (10) 

t'<t I 

The Green's function Gjk{t, s) is the response of the weight Jj at time t due to a unit 
stimulus added at time s to the right hand side of Eq. (|l]) corresponding to weight Jk, in 
the hmit of vanishing magnitude of the stimulus. 

In the limit of large N, we can apply a diagrammatic analysis similar to the case of batch 
learning . In contrast with batch learning, we need to first average Eqs. (P) and ([To|) over 
the distribution of example sequence using Eq. (|^). This can then be followed by the usual 
averaging over the distribution of background examples, as in the case of batch learning. 
The result is that we can neglect the effect of removing the background example represented 
by the second term in the square bracket of the right hand side of Eq. (^. Gjkit,s) is 
self- averaging and diagonal in larger N limit, so that Gjk(t,s) = G(t,s)6jk, where G(t,s) 
satisfies the Dyson's equations 

G{t,s) = G^^\t-s) + v j dti j dt2G'(°)(i-ii)(^afe)(ii,i2)i^;(t,)(t2))G'(t2,s)(ll) 
= 5{t-s) + ^j dUG{tM)F'^is){ti)D.{s){ti,s), (12) 

where G^^\t — s) = Q{t — s)exp[— t>A(t — s)] is the bare Green's function. Da-{s)it,s) is 
the example Green's Function, and ( ) represents average over distributions of both example 
sequences and examples. 

We emphasize that the average {D„{t2){'tiit2)F'^{t.2)it2)) is different from the average 
{D^{ti,t2)F'^(t2))- The former specifies that the function F'(t2) and D(ti,t2) are due to 
the example that was picked from the example sequence for learning at the particular in- 
stant t2- During on-line learning, the activation of this example receives a giant boost at the 
learning instant, as mentioned later in the text discussion of Fig. 1. This makes its distri- 
bution different from that of a randomly drawn example /i, whose previous learning instant 
remains unspecified. Hence, the former average will be referred to as an active average, in 
contrast to the latter, which is referred to as a passive average. 

Nevertheless, in the case of linear rules used for illustration later in this paper, F'{t) is 
a constant independent of t. Hence, the active average in Eq. (pA]) becomes identical to the 
passive average. Thus, the Dyson's equations ( pTj) and (|l^) becomes identical to those of 
batch learning Q], after rescaling the time and the weight decay. 
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In the case of Hebbian rule, F'{x) = and D^{t,s) = S{t — s). The Green's function 
becomes identical to the bare Green's function. 

In the case of the Adaline rule, F'{x) = —1 and -D^(t, s) = D(t, s) independent of example 
yU. The weight Green's function becomes invariant under translation of time, and can be 
written as 

G{t, s) = G{t - s, 0) = j p(a;)e-^(*-^)da;, (13) 
where p{x) is the density of state 

p{x) = {I - a)Q{l - a)5{x - vX) + ^ , (14) 

with Xinax and Xmin are the edges of the spectrum given by Xmax, a^min = v\ + v{l ± l/-\/a)^, 
respectively. 

The number of times m that the new example appears in time t follows a Poisson 
distribution with mean t/a. If these appearances occur at times ti, ■ ■ ■ , tm {tm < • • ■ < ti < 
t), Eq. (^ reduces to 

m 

J^{t) = J,{t) + ^J2^{t,U)Fo{tr)^l (15) 

Multiplying both sides by C,j and summing over j, one derives the relationship between the 
cavity activation and the generic activation of example 0, 

m 

Xo{t) = ho{t) +vJ2 G{t, tr)Fo{tr). (16) 
r=l 

This relation enables us to express the cavity activation h{t) of any example as a function 
of its generic activation x(ti), ■ ■ ■ , x(tm); x{t) at the previous and current learning instants, 
and attributes physical meaning to the single effective example in Ref. [113. Hereafter, we 
omit the subscript if no confusion occurs. 

The simulation results in Fig. |l| verify the relationship between the cavity activation 
and the generic activation for a randomly selected example. Up to t = 3, the example is 
drawn from the learning sequence a{t) 9 times, close to the Poisson average of t/a = 10. 
The solid line describes the evolution of x{t), which exhibits giant boosts at the 9 learning 
instants indicated by the vertical dashed lines. The dotted line describes the evolution of the 
cavity activation h{t), which is obtained in a second network which uses the same learning 
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FIG. 1: The evolution of the activations of a randomly selected example in a network with = 
1000, a = 0.3, n = 0.8, V = 0.1 and A = 0.1 using the Adatron rule. See text discussions for the 
explanations of the lines and symbols. 



sequence (T{t), except that learning is paused when the example is drawn. Since the example 
and this network are uncorrelated, h{t) evolves as a random walker with appropriate means 
and covariances. The filled circles indicate the values of the cavity activations predicted by 



Eq. (16), using the Green's functions measured by comparing learning with and without 
stimuli [jl8|. They show remarkable agreement with the simulated h{t). 

To derive the distribution of generic activations, we first consider the distribution of 
cavity activations, which is given in the Gaussian form at m learning steps and time to{= t) 
by 

exp{-| EZ=om) - R{U)y]iC - RR^-lm,) - R{t,)y]} 



P{h{to),--- ,h{t^)\y) 



v/(27r)™+i det{C-RR^) 



where C — RR^ is a square matrix with size m + 1 and {C — RR^ 



II] 



(17) 

C{ti,tj)-R{U)R{t,). 



The corresponding distribution of generic activations can be written as 

d{h{U),--- ,h{t^)) 



P{xito),--- ,x{t^)\y,y) = P{h{to),--- ,h{tm)\y) 



(18) 



where h(ti) is a function of x{ti), ■ ■ ■ , x{tm) defined by Eq. ([T6|), and the dependence on y 
may arise from the learning forces. Since dh{ti) / dx{tj) = for tj > U, and dh(ti) / dx^U) = 1, 
the Jacobian reduces to 1. Therefore, the distribution of generic activations can be expressed 
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as 

exp{-i j:Z=o[hiU) - R{U)y]{C - RR%'[h{t,) - R{t,)y]} 
F[x[to), ■ ■ ■ , x(tm)\y, y) = — . 

(19) 

In general, h{ti) can be a nonlinear function of a;(ti), ■ ■ ■ , x{tm)- Hence, the generic activation 
distribution in Eq. (|19D is no longer Gaussian, although the cavity activation distribution in 
Eq. (p!?!) is. This characteristic of on-line learning is demonstrated in numerical simulation 



for Adatron rule in Ref. |22|. 



We now illustrate how the above result can be applied to specific cases. For Hebbian 
rule, Eq. (p^ implies that h{t) is not an explicit function of x{tr) at the previous learning 
instants t^, 

m 

h{t) = x{t) - y^e^^[-vX{t - tr)]. (20) 

This enables us to write down the instantaneous activation distribution, given the learning 
instants ti, ■ ■ ■ , of the example, 

, , - , , , exp{-i(C(t,t) -/?^(t))-nx(t,) -yEr=ie-^^^'^-) -^(t)!/]^ 

Fix, in y]tl, - ■ ■ ,tm) = = . 

^ ' ^ y/2TT[Cit,t) - R^{t)] 

(21) 

The distribution is then averaged over the time distribution and the Poisson distribution of 
learning instants 

P{x, t\y, y) = (P(x, t\y, y;h,--- , tj),. (22) 

The sequence average of an instantaneous quantity ip at time t depending on the previous 
learning instants ti, ■ ■ ■ , tm is 

(^(t|ti, ■ ■ ■ , tm))a = y2^ / ■ ■ ■ / ^t„,tlj{t\h, ■ ■ ■ , U), (23) 

m=0 " "^0 

where the factor of m\ in the Poisson distribution is cancelled by the number of permuta- 
tions in ordering ti, ■ ■ ■ ,tm- Using the Hubbard-Stratonovich identity, we can factorize the 
integrals over tr{l < r < m) . We arrive at the result 

/dx r 1 
— exp ^tx[x - R{t)y] - - [C{t, t) - R^t)]^ 

+ - [ ds[exp(-ixye-''^(*-^)) (24) 
a Jo J 
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which agrees with the rule-specific derivation in Ref. |]T2| . 

For Adahne rule, substituting F{x, y) = y—x into Eq. (|I6|) yields a linear relation between 
the student activation and the cavity ones, 



r=0 



h(tr) + V ^ GrsV 



s=r+l 



(25) 



where G is a square matrix with size m + 1 and Grs = G{tr — ts,0) for t,. > ts and Grs = 
for tr < tg- Inserting the mean and variance of the cavity activation, we see that x(to) is a 
Gaussian variable with mean and variance 



(x(to)) = J2^1+vG),,' 



r=0 



R{tr)y + v ^ GrsV 



s=r+l 



A\to) = J2{l + vG)^X'^ + ^G),^'[C{tr,ts)-R{tr)R{ts)]. 



(26) 



r,s=0 



To obtain the activation distribution in such an application as the average training error, we 
need to further average the Gaussian distribution in Eq. (^) over the learning sequences. 

In general, for nonlinear learning rules, the linear inversion of Eq. (|T6D to obtain the 
student activation is not possible, and the activation distribution becomes non-Gaussian, 
even for a given sequence of learning instants. Nevertheless, a useful identity exists for the 
sequence average pertaining to an example, as derived in Appendix A, 



{x{t))^ = hit) + - / dt'Git,t'){Fit'))^. 
a Jo 



(27) 



This equation of the sequence-averaged activation is the same as that of the self-consistent 
equation of the activation in batch learning, after rescaling the time and weight decay [Q. 

B. The student-teacher correlation 



To analyse the student-teacher correlation, we multiply both sides of Eq. (|l|) by Bj and 
sum over j, yielding in the limit of large N, 

d 



(28) 



where (•)^ represents averaging over the distribution of examples. 
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For Adaline rule, the solution of R(t) in Eq. (pH]) involves {x^{t))a- By virtue of Eq. 
and exploiting the example Green's function in Eq. (|12|), we obtain 

{x^{t))„ = [ dtiD{t - ti) + - [ dt2G{h - t2)y, 

Jo L " io 

Applying Laplace transform to Eq. (^) and then Eq. (pS]), 

{x,{z))^ = D{z) \h,{z) + ^G{z 



az 



{z + vX)R{z) = v\--D{z) 



where 



7r(l+e2) 



Riz) + —G{z)] I , 
az J J 



for classification, 



VT+e^ for regression. 



(29) 



(30) 



Here G{z) = dte ^*G(t, 0) and D{z) = dte ^^D{t,0). Inverse Laplace transform 
yields 



R{t) = a I dxp{x){x — vX) 



1 - e 



X 



(31) 



which is the same as that in batch learning after rescaling. 



C. The student-student autocorrelation and the force fluctuation 

To analyse the student-student autocorrelation, we multiply both sides of Eq. ([l|) by 
Jj{s) and sum over j, thus obtaining in the limit of large A^, 

[j^+v\) C{t,s) = v{{x,{s)F,{t))^l^^^J^. (32) 

We remark that (x^(s)F^(t))a-|o-(t)=;x, is an active average, which is distinct from the passive 
average {x^{s)Ffj_(t))o-, where the example learned at time t is not necessarily fx. 

For Adaline learning rule, the average difference W^{s,t) = (a;^(s)F^(t))o-|o-(f)=^j — 
(x^(s)F^(t))o- can be expressed in terms of Green's functions and learning force, 

W,{s,t) = V J\t'D{s,t')G{t',t){F^^it)),, (33) 
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as shown in Appendix p|. A similar equation for the passive average is also derived therein, 



(x^(s)F^(t)), = / dt,D{s,h] 
'o 

2 /■min{t,s) 



a 



a 







/ dt2 / AhD{sM)G{t2M)D{tM)G{hM){Fl{U)),. 

(34) 



Therefore, one can perform the Laplace transforms 



\{F,{w)F,{z))^) 



OO j'OO 

C{w,z) = I ds dte-'"'~''C{s,t) 
Jo 

OO POO 

ds / dte-""'-'' 
Jo 



After substituting Eqs. ( |5BD and ( |5^ ) into Eq. (P^j), and performing elaborate algebra, one 
obtains an equation of the weight correlation 



C{w,z) 



2T 



X 



W2; azw 
G{z)D{z) -G{w)D{w 



{w + z)D{w)D{z) 



w — z 

and an equation of the force autocorrelation 



(35) 



{{F,{wmz)).)^ = - ^^GMb^wmz) - —G{z)D{wmz) 



wz 



wz 



wz 



+D{z)D{w)C{w, z) + -G{w)G{z)D{w)b{z){{Fl{w + z)),)^ 



(36) 



Here 



1 for classification 
1 + for regression. 



We note the presence of the force fluctuation term (^{F^{w + z))a) ^ in Eqs. (|35|) and (p6|). 
This term is absent in the corresponding equations in the case of batch learning. This can be 
seen by observing the scalings z~^ ~ G{z) ~ (^{F^{w + z))a') ^ ~ in Eqs. (p5|) and (|36D , so 
that the coupling of weight correlation C{w,z) and force autocorrelation (^{F^{w)F^{z))a) ^ 
via the force fluctuation (^{F^{w + -2))(t)^ will approach zero when v is vanishingly small, 
which indicates that this temporal correlation is unique to on-line learning. In contrast to 
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batch learning, the presence of the force fluctuation term increases the weight correlation 
via Eq. (pSf), which in term increases the force fluctuation itself via Eq. (p6D. As we shall 
see, this coupling leads to a collective relaxation mode. 

Performing the inverse Laplace transform, one can obtain the autocorrelation 



C{s,t) 



dxp{x){x — v\) 



bv 

ha 

a 



+v / dxp{x){x — vX) 



X 

min(t,s) 



vX 

a 



X 



1 



X 



dt'e-"(*+^-2*')(F2(t')) 



^-x\t-s\ ^-~x{t+s) 

+T I dxp{x) , 



X 



(37) 



where {F^it)) = ((^^(t))^)^. In Eq. {^), the first and third terms are similar to those in 
the autocorrelation function in batch learning. The second term represents the contribution 
induced by the force fluctuations arising from on-line learning. It vanishes when f — >■ 0. 
However, for finite learning rate v, one can see that this term plays a similar role as the 
dynamic noise. This can be seen by considering the asymptotic limit of the second term, 
where {F^(t')) approaches the steady-state value of {F"^), yielding 



/ dxp{x){x - vX)- 



-x\t—s\ 



X 



(38) 



Comparing with the dynamical noise in the third term, we see that (F^) is a measure of an 
effective temperature, and the two noise contributions different slightly in their spectrum 
of relaxation rates. Therefore, in practice, force fluctuations can also assist the learning 
dynamics in avoiding being trapped in metastable states, playing the same role as dynamical 
noises to batch learning. Hereafter, we let T = in our final results. 

To obtain the force autocorrelation we substitute Eq. (p5|) into Eq. ( |36D 

and perform the inverse Laplace transform, which yields. 



V 



dxp{x) 



bv 2 f A 
ha [x — vX 

a \ a 



X 



{x — vX) 



X 




min(t,s) 



dt'{F\t')) / da;p(x)(x-t;A)V"(*+^-2*'). 



Equating t and s, the force fluctuation is given by the inverse Laplace transform of 
^ + i f dxp(x)^ \^ + a^x-vX- ^)] \^ + ^ - 

z V J ' ^ ' X'' I a \ a/i lz+2x z+x z J 



1 - / dxp{ 



X 



{x—vX}^ 
z+2x 



(39) 



(40) 
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The final expression consists of four contributions. First, the pole at z = gives the steady 
state value of 

6 - i f dxp(x) ( 1 - + (x - vX - i^)] 



1 - / dxp( 



X 



2x 



The second and third contributions come form the relaxation spectrum of force fluctuations 
ranging through (a;min, a;max) and (2a;min, 2xmax), respectively. The fourth contribution is 
described by the existence of a collective relaxation mode arising from the force-weight 
coupling, which is a novel phenomenon of on-line learning. Its relaxation rate is given by 
the pole 

dxp(x)^|^l^ = 0. (42) 

This is called slow mode in Ref. [|l^. When X^ approaches zero, the steady-state force 
fluctuation and student weight will diverge. The critical learning rate at which the weight 
diverges in given by 

4 



2 - X[l + a + aX - + a + aX)^ - 4a] ' 
which is also derived through the spectral analysis 



(43) 



D. The training and generalization errors 

The performance of learning is measured by the training and generalization errors. Here 
we provide their expressions of noiseless example. Expression for other cases can be derived 
similarly. In classification, the generalization error is defined as the probability that a new 
example presented to the network is misclassified, Eg(t) = {Q[—{B ■ ^){J(t) ■ ^)])^. It 
is determined by the magnitude of the student vector C{t,t) and its correlation with the 
teacher vector R{t) 0], that is, 

E (t) = lcos-i^£L. (44) 

Analytical expressions of R{t) and C{t,t) are derived in previous subsections for Adaline 
rule. 
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The training error is defined as the fraction of examples in the training set that are 
classified wrongly, i.e. 

Et{t) = J dyj dyP{y, y) ^ — ^ dti • ■ ■ dt^ J dx{t) 

P{x{t)\y,y;h, - ■ ■ ,tm)&[-yx{t)]- (45) 

This can be computed by a Monte Carlo sampling procedure, which has been shown to be 
free from finite size effects [|TU|. For general learning rules, we adopt the procedure with the 
following steps: 

1) For a given training example, generate the teacher activation y and y according to 

Piy,y)- 

2) For time to{= t), generate the number of times m the example appears in a training 
sequence from time to to according to a Poisson distribution with mean to/ a. 

3) Generate the instants ti, ■ ■ ■ ,tm that the example appears in the training sequence 
according to a uniform distribution between and to, with < tm < ■ ■ ■ < ti < to- 

4) Generate the cavity activations h{tr), r = 0, ■ ■ ■ , m according to the Gaussian distri- 
bution with mean R{tr)y and covariance C{tr,ts) — R{tr)R{ts). This can be carried out by 
generating the independent Gaussian variables Zik {k = 0, - ■ ■ , m) with mean and variance 
1, and transforming them to h{ti) via 

m 

h{U) = R{t)y + J2 ^ik^k, 0<i<m (46) 

k=i 

and the matrix elements are obtained from the recursion relations 
and for 1 < j < i < m, 

Ar 



C(trn—iitm—j) R{tm—i)Ritm—j) ^m— i.m— fc^m— I'.m— fc 



'■m—i,m—j < 

2 



A 



C {trn—i,m—i) R (tm—i) ^ ^ ^ 



i-1 

2 

m—i,m—k 

k=l 



(48) 



5) Compute x{ti) according to Eq. ([Tq) . This enables us to collect samples for the 
distribution P{x{tQ)\y,y), and hence estimate the training error. 

6) Steps 1) to 5) are repeated to yield sufficient amount of statistics. 
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For Adaline learning, step 5) of the Monte Carlo sampling procedure can be further 
simplified by exploiting the Gaussian nature of the generic activation distribution. Using 
Eq. ( p6D to find the mean (^(to)) and variance (T^(to), the contribution of a example to the 
training error is given by erfc((a;(to))y/v^cr(io)) • 

The above procedure assumes that the Green's function G{t, s) and the correlations R{t) 
and C{t, s) are known a priori. For general learning rules, G{t, s) can be obtained by solving 
the Dyson's equations Eqs. (pUf) and ([12|) . Since the equation involves an average over the 
distribution of learning sequence and examples, it can again be obtained by a Monte Carlo 
sampling procedure. Similarly, R(t) and C(t,s) can be obtained by solving Eqs. (^) and 
(^) by Monte Carlo sampling. These will be left for further studies. Here, for the exposition 
of the cavity method, we focus on Adaline learning, where these functions can be obtained 
analytically as described in the previous subsections. 

For the purpose of consistency check, one can also obtain these functions directly from 
simulations and plug into the Monte Carlo sampling procedure to check whether the gener- 
ated activation distribution agrees with simulations. 

The proposed Monte Carlo procedure is similar to the effective single pattern process 
in Ref. |]I5|. The difference lies in the generation of the learning instants of an example 
according to the Poisson distribution. In Ref. |T3], an individual Poisson number with mean 
A/a is generated for every time increment A in the learning history. Here, the sampling 
efficiency is improved by a single Poisson number m with mean to /a and m learning instants 
with uniform distributions. For general learning rules, even if the Green's functions and 
correlation functions have to be generated from the Monte Carlo sampling procedure, it is 
possible to use similar efficient samplings. This will be left for further studies. 

While the Monte Carlo sampling procedure is useful in studying the transient behavior of 
learning, it can also be used to extract the stationary properties. At a very large observation 
time t, we look back at the learning history of an example by the network. Reasonably, only 
those learning events occur recently have detectable contributions to the distribution of the 
student activation x(t) in Eq. (pGl). Therefore, we can calculate this distribution to any 
desired precision by adding the earlier learning events one by one, until certain stopping 
criteria are satisfied. Since the time intervals between successive learning events obey an 
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exponential distribution, we have 

m 

P{x, t\y, y;t-si,--- , t - ^ s^). (49) 

r=l 

For Adaline learning, the distribution P{x, t\y, y;t — Si, ■ ■ ■ ,t — Sr) is replaced by a 

Gaussian distribution with mean and variance given in Eqs. (|26|) . We find that the contri- 
bution of earlier events approaches zero very quickly as m increases. Thus, we only need to 
invert small matrices in evaluating the training error at steady state. 



P{x\y,y) = lim lim TT [ — - 



IV. RESULTS AND DISCUSSIONS 



Figure 2(a) shows the transient behavior of the three macroscopic parameters, student- 
teacher correlation, student autocorrelation and force fiuctuation for typical learning param- 
eters. The training error and generalization error are shown in Fig. 2(b). The theoretical 
predictions have an excellent agreement with simulations. 

Figure 3 shows the generalization and training errors at the steady state. The theoretical 
predictions agree well with the simulation results. The learning dynamics diverges at the 
critical learning rate Vc- It is also observed that strong weight decays tend to restrain this 
divergence at large learning rate, pushing Vc to higher values. On the other hand, strong 
weight decays increase the generalization error when v is small. 

In Fig. 3, we also present the results obtained by the mean- force approximation adopted 
in Ref . . As we have shown above, in steady state, the sequence noise due to the random 



drawing of examples at each learning step is equivalent to the external dynamical noise 
in batching learning. Thus, the dynamical variables will fiuctuate around their temporal 
average even without other external noises. As a result, the mean-force approximation is 
only valid when the learning rate and the sequence noise is small. As shown in Fig. 3, it 
has an increasing discrepancy with simulations when v becomes large. The critical learning 
rate = 2(1 + A) estimated by the mean-force approximation is larger than the simulation 
result. This discrepancy can be attributed to the omission of the force fiuctuations therein. 

An important question is whether on-line learning can perform as well as batch learning 
|pO| , We have proposed an averaged strategy in the context of batch learning, predicting 
that dynamical noise can be averaged over to yield performances approaching noiseless 
learning 0]. Since sequence noise in on-line learning has similar effects as dynamical noise, 
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FIG. 2: The evolution of (a) teacher-student correlation R{t), student autocorrelation C{t,t) and 
force fluctuation (b) training error Et and generalization error Eg for Adaline learning 

at a = 1.2, V = 1.9, and A = 0.8. Solid lines: the cavity method, with analytical results for 

R{t),C(t,t), {F'^{t)),Eg and Monte Carlo results for Et averaged over 500,000 samples. Symbols: 
simulations averaged over 100 samples with N = 500. 

we adopt the same strategy to improve performance of on-line learning. 

In average learning, we first wait for the system to settle to the steady state, and then 
monitor the student weight vector for an extended period of time. We use the weight 
vector averaged over the monitoring period t, J — limt_^+oo ^ J^*^^ J{t')dt', as the estimated 
student vector. The average weight amplitude is given by 

C'(t) =lim-/ dt'C{t,t + t'). (50) 
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FIG. 3: The dependence of (a) the training error Et and (b) generahzation error Eg on the learning 
rate v for a = 0.5. Sohd Unes: steady-state results of the cavity method for A = 0.4. Dashed lines: 
steady-state results of the cavity method for A = 0.8. Corresponding symbols: simulations averaged 
over 100 samples with N = 1,000 and t = 20. Dotted lines: the mean-force approximation for 
A = 0.8. 



Using Eq. {p7\), we obtain 

— f X — vX 

C{t) = I dxp(x' 



2 I ^ 

\- a [x — vA — 



a \ a 



where (F^) is given by Eq. (^Tf ). We note that when r becomes vary large, the contribution 
due to force fluctuations vanishes, making C{t) approaching the result for batch learning. 
Hence, the average learning strategy yields a generalization performance as good as that of 
the batch mode, independent of the learning rate, as long as it is below the critical value 
for divergent learning. Figure 4 shows that the generalization error of on-line learning is 
equal to that of the batch mode at f = 0, and gradually grows larger when v increases. 
When the monitoring time r increases, there is an impressive reduction of the generaliza- 
tion error compared with its instantaneous values. The longer the monitoring time r, the 
smaller generalization error. In the limit of very long monitoring time, the generalization 
ability becomes equal to that of the batch mode for all values of v below Vc, and jumps 
discontinuously to divergence at Vc- 
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FIG. 4: The steady-state generalization error Eg averaged over different monitoring times at a = 2 
and A = 0.2. Symbols: simulations averaged over 10 samples with N = 1000, r = (•), 2 (■), 5 
(♦), 10 (A) and 20 (T). Corresponding lines: theory. 



V. CONCLUSION 



We have analysed the dynamics of on-hne learning with restricted sets of examples, 
which are randomly recycled. Using the cavity approach, we can derive equations for the 
macroscopic parameters describing the learning dynamics. They are solvable for linear rules 
such as the Adaline rule, yielding results which agree with simulations. We also show that 
the student in on-line mode can learn as well as that in batch mode, after it is averaged over 
a monitoring period at the steady state. 

Our work represents a step forward from two recent treatments [ll3| , . Compared with 
T3| , we have found that a functional relationship exists between the generic and cavity 



activations, and made explicit the distribution of activations, which is a superposition of 
Gaussians for Adaline rule and each weighted by a Poisson distribution. As a result, our 
framework can be used to analyse the training error Et. More importantly, it has the 
potential to be extended to analyse nonlinear and multilayer networks. Compared with []15| , 



we have based our analysis on a more physical picture. For example, we have made explicit 
the difference between the active and passive averages involving the activation of an example 
and its learning force at a previous instant. This enables us to analyse correctly the network 
behavior at large learning rates, where the mean-force approximation does not apply. The 
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physical insights will be useful when analytical approximation schemes are devised for more 
complex networks. 

We have achieved our objective of benchmarking the cavity approach using Adaline learn- 
ing. The next step will be to extend the method to more complicated situations such as 
nonlinear and multilayer networks. We may devise efficient Monte Carlo sampling pro- 
cedures to solve numerically the equations for the Green's functions, the teacher-student 
correlations and the student autocorrelations, making use of the Poisson distribution of the 
learning events for a single example, the Gaussian distribution of the cavity activations, and 
their causal relations with the generic student activations. 

The cavity analysis of linear networks is also the foundation of approximate descriptions 
of the stationary behavior of nonlinear and multilayer networks. One may describe the 
steady state by fluctuations about an averaged state. The fluctuations can be approximated 
by linear deviations which can be analysed by the cavity approach analogous to the Adaline 
benchmark. Present work in progress is moving along this direction. 

Recently, various approximation schemes have been proposed in different learning regimes 
of complicated networks [|1^, yielding results with varying degree of success. For 

example, the conditionally Gaussian approximation cannot capture the sharp peaks in the 
activation distribution developed at the late learning stage of nonlinear rules (e.g. Adatron 
learning). On the other hand, from the perspectives of the cavity method, the nonlinear 
mapping from the Gaussian cavity activations to the generic activations can lead to sharp 
peaks in a natural way p . It is hoped that the cavity framework can provide useful insights 
to improved approximations in the future. 
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APPENDIX A: SEQUENCE AVERAGE OF THE STUDENT ACTIVATION 



According to the Eq. (^) and Eq. (|T6D, the sequence average of the student activation 
of an example is 

<-t2 



e a 
a"* 



r=l 



{x{t)), = h{t)+vY^ 

m=l 

Reordering the summations and the integrations before and after t^, 
{x{t))„ = h{t) 



(Al) 



oo t m „f 



m=l 



br-1 



dti 



t2 



dtr+l ■ 



tm — 1 



dtmG{t,tr)F{tr 



(A2) 



where we have made imphcit the dependence of F{tr) on the previous learning instants 
tr+i, ■ ■ ■ ,tm- Since the integrand in the square bracket does not depend on the values of 
ti, - ■ ■ , tr-i, the integration over these variables then simply gives the factor of {t—trY~^/{r — 
1)!. We further note that 



e Q i \ ^ e « \ ^ e ^ 



— v — \ e a L X ^ e « X ^ 

- ^ ^m~r 
m— r=0 



m=l r=l 



r-l=0 



r~l 



and then derive 



(X(t))^ = h{t) + - / &trG{t,tr) 



a 



oo ^h. otr 



E 

m— r=0 



a' 



dt. 



r+1 ■ ■ ■ 



(A3) 



(A4) 



From Eq. (p3D , the summation in the square bracket is just the sequence average of the 
activation at time tr, yielding Eq. 



APPENDIX B: SEQUENCE AVERAGE OF ACTIVE AND PASSIVE CORRE- 
LATIONS 



From Eqs. ( |16|) and (p3|), the active sequence average at time s of the activation of example 
/i that is learned at time t is 

'*tm— 1 



m=0 



oo s — t 

e a 



dt 



m+l 



i: 



dt 



m+n 



^tm^ 

n=0 

m+n 

Y,G{s,tr)F^{tr) + G{s,t)F^{t) 



r=l 



(Bl) 
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Using similar arguments as in Appendix 0, the average for the Adahne rule can be written 
as a self-consistent equation for s > t, 



(x^(s)).U(t)=M = h^{s) + - [ dt'G{s,t'){F^{t'))^ + vG{s,t){F^{t))^ 



a 







+1 I dt'G{s,t')[y^,- {x^{t'))X{t)=^- 

^ J t 



(B2) 



Note that {x^{s))„\o■(t)=^i = {Xfj,{s))a- when s < t. Subtracting Eq. (^) from the above 
equation, one can derive 

{x^{s))^Ut)=^. - {x^^it))^ = vG{s,t){F^it))^ - - [ dt'G{s,t')[{x^it'))^\^^t')=^. - {Xf^it'))^]. 

Jo 

(B3) 

If we multiply -F^(t) to both sides of Eq. (|16]) at time s, and perform sequence averages 
analogous to Eqs. ( [BID and (P^), we obtain equations for the active and passive averages of 



the activation-force correlation, 

(x^(s)F^(t)),U(t)=^ = h^{s){F^{t))^ + - / dt'G{s,t'){F^{t')F^{t))^l^t')=^ 

Jo 

+vG{s,t){F'^{t)), + - I dt'G{s,t'){F^{t')F,{t))X{t)=,, (B4) 

^ J t 

{x,{s)F,{t))^ = h,{s){F^{t))^ + - [ dt'G{s,t'){F,{t')F,{t))^l^t')=, 

a Jo 

+- r dt'G{s,t'){F^{f)F^{t))^. (B5) 
^ Jt 

Subtracting Eq. (Q) with Eq. leads to 

W^{s,t) = vG{s,t){F^{t))^-- [ dt'G{s,t')W^{t',t). (B6) 



a _ 

Multiplying the example Green's function D{r, s) to both sides of Eq. ( p6|) , integrating over 
s, and applying Eq. ([121) , one obtains Eq. (^3[). 

To obtain Eq. (^4[), one replaces (F^(t')F^(t))o-|o-(t')=;x in the right hand side of Eq. ( p5[ ) 
with (F^(t')i^^(t))a J^, dtiD{t,ti)G{ti,t'){F^{t'))^, and arrives at 

(x^(s)F^(t))^ = /i^(s)(F^(t))^ + - r dt'G{s,t'){y,F,{t)),-- f dt'G{s,t'){x^{f)F^{t))^ 

ot Jo ot Jo 

-- [ dti / dhG{s,h)D{t,t2)G{t2,h){F^^{h)),. (B7) 
a Jo Jh 



Multiplying both sides with D^{r,s), integrating over s and applying Eq. ([T2[), one finally 
reaches Eq. (0). 
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